Novel similarity-based clustering algorithm for grouping broadcast news
نویسندگان
چکیده
The goal of the current paper is to introduce a novel clustering algorithm that has been designed for grouping transcribed textual documents obtained out of audio, video segments. Since audio transcripts are normally highly erroneous documents, one of the major challenges at the text processing stage is to reduce the negative impacts of errors gained at the speech recognition stage. Other difficulties come from the nature of conversational speech. In the paper we describe the main difficulties of the spoken documents and suggest an approach restricting their negative effects. In our paper we also present a clustering algorithm that groups transcripts on the base of informative closeness of documents. To carry out such partitioning we give an intuitive definition of “informative field of a transcript” and use it in our algorithm. To assess informative closeness of the transcripts, we apply Chi-square similarity measure, which is also described in the paper. Our experiments with Chi-square similarity measure showed its robustness and high efficacy. In particular, the performance analysis that have been carried out in regard to Chi-square and three other similarity measures such as Cosine, Dice, and Jaccard showed that Chi-square is more robust to specific features of spoken documents.
منابع مشابه
Topic detection in broadcast news
We propose a system for the Topic Detection and Tracking (TDT) detection task concerned with the unsupervised grouping of news stories according to topic. We use an incremental k-means algorithm for clustering stories. For comparing stories, we utilize a probabilistic document similarity metric and a traditional vector-space metric. We note that that the clustering algorithm requires two differ...
متن کاملIntegration of evolutionary computation algorithms and new AUTO-TLBO technique in the speaker clustering stage for speaker diarization of broadcast news
The task of speaker diarization is to answer the question "who spoke when?" In this paper, we present different clustering approaches which consist of Evolutionary Computation Algorithms (ECAs) such as Genetic Algorithm (GA), Particle Swarm Optimization (PSO) algorithm, and Differential Evolution (DE) algorithm as well as Teaching-Learning-Based Optimization (TLBO) technique as a new optimizati...
متن کاملAn Optimization K-Modes Clustering Algorithm with Elephant Herding Optimization Algorithm for Crime Clustering
The detection and prevention of crime, in the past few decades, required several years of research and analysis. However, today, thanks to smart systems based on data mining techniques, it is possible to detect and prevent crime in a considerably less time. Classification and clustering-based smart techniques can classify and cluster the crime-related samples. The most important factor in the c...
متن کاملA system for retrieving broadcast news speech documents using voice input keywords and similarity between words
This paper describes a robust speech documents retrieval system that uses voice input keywords. To solve the inevitable problems which arise when the input to the system is speech, i.e. misrecognition, a novel method was developed, where, before the retrieval processing, unproductive keyword candidates are discarded by a grouping processing using the similarity between words and the recognition...
متن کاملA Cluster-based Approach to Broadcast News
We present an approach to detection and tracking of topics in multilingual broadcast news based upon a dynamic clustering scheme. Our approach derives from a system used to filter Web searches from multiple sources, with extensions for pipelining document clusters, part-of-speech tagging and extraction of named entities for use in an extended similarity measure.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002